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Abstract 

This paper presents a way of solving Markov Decision Processes that combines state abstraction 
and temporal abstraction. Specihcally, we combine state aggregation with the options framework 
and demonstrate that they work well together and indeed it is only after one combines the two that 
the full benefit of each is realized. We introduce a hierarchical value iteration algorithm where we 
first coarsely solve subgoals and then use these approximate solutions to exactly solve the MDP. This 
algorithm solved several problems faster than vanilla value iteration. 


1 Introduction 

Finding solutions to discrete discounted Markov Decision Processes (MDPs) is an important problem 
in Reinforcement Learning. The basic problem is to obtain the optimal policy of the MDP so that the 
overall discounted reward obtained as we follow this policy within the MDP is maximized!^ In this work, 
we do not work with the optimal policy directly but compute the optimal value function instead. 

The approach we take in this paper is to modify the well-known value iteration (VI) algorithm [2]. 
The basic idea of VI is to keep iterating the Bellman optimality equation. This is well-known to converge 
to the optimal value function. Our framework is conceptually based on a natural extension of the Bellman 
optimality equation where matrix models take the place of vector value functions. 

In order to solve large problems, table-lookup algorithms are not practical because of the sheer number 
of states, which VI must loop over. Hence the need for state abstraction. For this work, we chose 
aggregation [3], which can be nicely integrated into our framework of the modified Bellman optimality 
equation. Algorithms based on single-step models of primitive actions are impractical, because long 
solution paths require many iterations of VI. Hence the need for temporal abstraction^ We solve this 
problem via the use of options [Mna — we construct option models which can be used interchangeably 
with the models we have for primitive actions. 

To our knowledge, this is the first paper where an algorithm using options and value iteration ef¬ 
ficiently solves medium-sized MDPs (our 8-puzzle domain has 181441 states). Unlike prior work |16) . 
we demonstrate a modest improvement in runtime performance as well as a significant reduction in the 
number of iterations. Also, we have the first convergent VTstyle algorithm where options (temporal 
abstraction) are combined with a framework for state abstraction, yielding far better results than the 
use of either idea alone. Furthermore, our algorithm is based on a principled extension of the Bellman 
equation. We emphasise that our algorithm converges to the optimal value function — although we find 
approximate solutions to the subgoals, these solutions are then used as inputs to solve the original MDP 
exactly, regardless of the choice of subgoals. 

2 Background and Prior Work 

2.1 State Aggregation 

Considei]^ [3] an MDP with \A\ actions; for an action a the probability transition matrix is Pa, defined 
by Pa{i^j) = 7 Pr(ii+i = j\it = i,at = a) and the vector of expected rewards for each state is Ra^ where 
the element corresponding to state i is defined by Raii) = E[rt|ii =i^at = a]. There are m aggregate 
states. We introduce the aggregation [3] matrix $ and the disaggregation [3] matrix D of dimensions 

^ Our framework works for the discount factor 7 < 1 as well as for those cases with 7 = 1 where standard value termination 
converges (for example if there is a ‘sink’ state). 

^Note that there is some evidence m that subgoal-based hierarchical RL is similar to the processes actually taking 
place in the human brain. 

^We refer the reader to the more elaborate introductory section in the appendix 
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n X m and m x n respectively. Under the state aggregation approximation , solving the original MDP 
may be replaced by solving a much smaller aggregate MDP, by computing Pa = DPa^ and Ra = DRa- 
The solution can then be computed by any known algorithm. VI is convergent because the matrices Pa 
and Ra define a valid MDP. This gives us a value function in terms of the aggregate states. 


2.2 Options and Matrix Models 


An option [2Q11I91II3] is a tuple consisting of a policy fi, mapping states to actions, as well as 

a binary termination condition /?, where /3(z) tells us whether the option terminates in state i. We will 
now discuss models muD] for options and for primitive actions. A model consists of a transition matrix 
P and a vector of expected rewards R. For a primitive action a, we defined Pa and Ra in section [O] 
For options they have an analogous meaning. R{i) is the expected total discounted reward given the 
option was executed from state i, R{i) = yVtjzo = *] where r is the (random) duration of the 

option and ig is the starting state. The element P{i,i'), is the probability of the option terminating in 
state i', given we started in state i, considering the discounting: P{i,i') = = 0- 

Denote by io the starting state of trajectory and by v the final state. It is convenient m to arrange 


P and i? in a block matrix of size (n + I) x (n + 1), in this way: 


1 

0 

R 

P 


Now model composition 


corresponds to matrix multiplication, i.e. if and are block matrices, is also a block 

matrix corresponding to first executing the option defined in and then the one defined in . In 
this paper, we assume that the action set A = {Ai ,..., A;} is already given in this matrix format. We 
introduce a similar format for value functions. The value function V is represented as a vector of length 
n + I with 1 in the first index and the values for each state in the subsequent indices. MV is a new value 
function, corresponding to first executing the option defined in M and then evaluating the states with 
V. Element i + 1 of the vector V to state i, as does row i + 1 of the action model. We use MATLAB 
notation, i.e. V{i -\-l) is element i + 1 of vector V and M(i + 1, :) is row i + I of the matrix M. 


2.3 Other Ways of Using Hierarchies to Improve Learning 

We give a brief survey of known approaches to hierarchical learning. We stress that our approach is 
novel for two reasons: we compose macro-operators at run-time and we have no fixed hierarchy. This 
has not been done to date, except in the work on options and VI m, which introduced generalizations 
of the Bellman equation, versions of which we use. But it did not include state abstraction, slowing the 
resulting algorithm — it only produced a decrease in the iteration count required to solve the MDP, 
while we provide better solution time. Other approaches include using macro-operators to gain speed 
in planning [10], but for deterministic systems only. Prior work on HEXQ |6] is largely orthogonal to 
ours - it focuses on hierarchy discovery, while we describe an algorithm given the subgoals. The work 
on portable options |5] only discusses a flat, fixed (unlike this work) options hierarchy. MAXQ |S] also 
involves a pre-defined controller hierarchy (the MAXQ graph)|^ Combining the use of temporal and 
state abstraction was tried before, but differently from this work. The abstraction-via-statistical-testing 
approach |7] only works for transfer learning — options are only constructed after the original MDP 
has been solved. The U-tree approach |5] does not guarantee convergence to V* for all MDPs. The 
modified LISP approach [T] uses a fixed option hierarchy and the policy obtained is only optimal given 
the hierarchy, i.e. it may not be the optimal policy of the MDP without the hierarchy constraint. 


3 Table-lookup Value Iteration 

We begin by describing the table-lookup algorithm for computing the value function of the MDP. It is 
similar to the one described in previous work m. but not the same — here, termination is implemented 
in a different, more intuitive, way. We start with plain VI and then proceed to more complicated variants. 
In MATLAB notation (see section , VI can be described as follows for state i. 

L(fc+i)(* -l- I) maxAa{i + I, ■)V(k) (1) 

Here, a selects an action (control). We rewrite this update to construct a model corresponding to the 
optimal value function — this is not useful on its own, but will come in handy later. The following is 
executed for each state i. 


a ^ argmax Aa{i -f 1, :)M(fe) [1, 0 ,..., 0 ]^; 

a 

+ 1) 0 ^ Aa{i + 1, ■)M(^k) (2) 

"^One can learn a MAXQ hierarchy m> but only in a way when it is first learned and then applied. 
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We note that the multiplication ... ,0]^ simply extracts the total reward in the model 

(the current value function) — hence eq. |^is equivalent to plain VI. However, it serves an an important 
stepping stone to introducing subgoals, which is what we do next. Assume that we are, for the moment, not 
interested in maximizing the overall reward. Instead, we want to reach some other arbitrary configuration 
of states defined by the subgoal vector G of length n + 1. The entry * + I of G defines the value we 
associate with reaching state i. We will show later how picking such subgoals judiciously can improve 
the speed of the overall algorithm. Our new update, for subgoal G is the following, which we execute for 
each state i. 


a -ir- argmax Aa(i + 1, :)M(fc)G; 

a 

+ 1) 0 t— Aa{i + I, :)-^(fc) (3) 

This iteration converges m to a model Moo, which corresponds to the policy for reaching the subgoal 
G. However, this policy executes continually, it does not stop when a state with a high subgoal value of 
G{i + I) is reached. We will now fix that by introducing the possibility of termination — in each state, 
we first determine if the subgoal can be considered to be reached and only then do we make the next 
step. This is a two-stage process, given below. First, we compute the termination condition f3(i) for each 
state i, according to the following equation. 

/3(fe)(z) ^ argmax j5(k){i)G{i -h I) -f 

(l-/3(fe)(z))M(fc)(i + l,:)G (4) 

We note that this optimization is of a linear function, therefore we will either have = 1 (terminate 

in state i), or /3(fe)(i) = 0 (do not terminate in state i). Conceptually, this update can be thought of as 
converting the non-binary subgoal specification G into a binary termination condition j3. Once we have 
computed /3(fc), we define the diagonal matrix = diag(I,/3(fc)(l),/3(fc)(2),... ,/3(j.)(n)) as well as the 
new matrix B as follows 0 

-l- (/ — P(k))AI(k) 

Here, I is the identity matrix. B summarizes our termination condition — it behaves like model M^k) 
for the states where we do not terminate and like the identity model for the states where we do. Once 
we have this, we can define the actual update, which is executed for each state i. 


a ^ argmax Aa(i -f 1, :)B{l3(k),M(k))G] 

a 

AI{k+i){'i + Ij 0 ^ + I, :)H(/3(j.), M(^k)) 


(5) 


By iterating this many times, we can obtain Mqo, which will tend to go from every state to states with 
high values of the subgoal G. The elements of G are specified in the same units as the rewards — i.e. 
this algorithm will go, for the non-terminating states, to a state with a particular value of the subgoal if 
the value of being in the subgoal exceeds the opportunity loss of reward on the way. For the terminating 


states, the model will still make one step according to the induced policy (see discussion in section 2.2). 

There is one more way we can speed up the algorithm — through the introduction of initiation sets. 
In this case, instead of selecting an action from the set of all possible actions, we only select an action 
from the set of allowed actions for a given state (the initiation set). More formally, let Sa{i) be a boolean 
vector which has ‘true’ in the entries where action a is allowed is state i and ‘false’ otherwise. Equation 
then becomes the following. 


a ^ argmax Aa(i -h 1, :)B{f](^k), ^{k))G; 

a-.Sai'i) 

M(k+i){i + Ij 0 Aa{i + I, '.)B{l5(k), Alf^k)) (6) 

The benefit of using initiation sets is that by not considering irrelevant actions, the whole algorithm 
becomes much faster. We defer the definition of initiation sets used to section [631 

Finally, we solve for several subgoals simultaneously. We use the current state of every model 
in every iteration, to compute the next iteration for both itself and other models. Denote our sub¬ 
goals by G*'^^ G*'^^..., G*-®^ and the fc-th iteration of the models trying to solve these subgoals by 
,..., Define the set as the set of all models (macro-actions) allowed at itera¬ 
tion fc, i.e. = {Ai, A 2 ,..., Ai, M^^y M^^y ..., This gives rise to the update given below, for 

^The reader will notice that our matrix B can be understood to be the expected model given the termination condition: 
Mj~) = [^1 ^(fc)]- However, in our algorithm it is enough to consider it just a matrix. 
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each subgoal q and for each state i. We now compute the termination condition. 

^ argmax {i + 1) + 

/3(fc)(i)6[0.1] 

(l-/3(fc)«)Mg(* + l,:)G(«) (7) 

The we compute one step of the algorithm according to the equation. 

O ^ argmaxO(i + 1, 

+ 1,:) ^ 0(^ + 1, :)i?(/3g, Mg) (8) 

Solving several subgoals simultaneously can improve the algorithm [16] . The immediate availability of 
the partial solution to every subgoal leads to faster convergence. In other words, this feature can be used 
to construct the macro-operator hierarchy at run time of the algorithm]^ This is in contrast to many 
other approaches, where the hierarchy is fixed before the algorithm is run. 


4 Combining State Aggregation and Options 


We saw in section |2.l| that given the aggregatiorQ matrix d) and the disaggregation matrix D, we can 
convert an action with the transition matrix P and expected reward vector R to an aggregate MDP by 
using P = DP^ and R = DR. In our matrix model notation, this becomes as follows. 


i = 


1 

0 

A 

1 

0 

0 

D 

0 

$ 


where A = 


1 

0 

R 

P 


(9) 


This can be viewed as compressing the dynamics, given our aggregation architecture $ of size n x m, 
where m is the number of the aggregate states. We stress that the compressed dynamics define a valid 
MDP — therefore the algorithms described in the previous section are convergent. 

The main idea of our algorithm is the following: define a subgoal, solve it (i.e. obtain a model for 
reaching it) and then add it to the action set of the original problem and use it as a macro-action, 
gaining speed. We repeat this for many subgoals. Solving subgoals is fast because we do it in the 
small, aggregate state space. To be precise, we pick a subgoal G (see section for examples) and an 
approximation architecture $. We then compress our actions with eq. [^and use compressed actions in 
VI according to eqs. |^and|^ This gives us a model Moo solving the subgoal in the aggregate state space. 
We want to use this model to help solve the original MDP. 

However, we cannot do this directly since our model M^o is defined with respect to the aggregate state 
space and has size (m -I- 1) x (m -|- 1) — we need to find a way to convert it to a model defined over the 
original state space, of size (n -I- 1) x (n -|- 1). The new model also has to be valid, i.e. correspond to a 
sequence of actions 

The idea is to make the following transformation: from the aggregate model, we compute the option 
in the aggregate state space, we then up-scale the option to the original state space, construct a one- 
step model and then construct the long-term model from it. Concretely, we first compute the option 
corresponding to the model Moo- The option consists of the policy p, and the termination condition /?. 
We obtain the termination condition by using eq. for the aggregate states. The policy /i is obtained 
greedily for each aggregate state x. 


fi{x) = argmax Hc{a; -I- 1, :)B{j3, Moo)G 

C 


( 10 ) 


Now, we can finally build a one-step model in terms of the original state-space. We do this according to 
the following equation, which we use for each state i. 


M'{i + l,:) — (1 —/3((/)(i))) H^(0(q)(i-|-1,:)-|- 


( 11 ) 


In the above, we denote by I the identity matrix of size (n -I- 1) x (n -I- 1) and by the aggregate state 
corresponding to the original state i. In more understandable terms, M' has rows selected by the policy 

®By this we mean that the option models are built up in run time, possibly using other models. The subgoals are 
pre-defined and constant. 

^In the work done in this paper, we used hard aggregation so that each row of $ contains a one in one place and zeros 
elsewhere, and the matrix D is a renormalized version of so that the rows sum to one. 

®That is why it is not possible to just upscale the model by writing: 

^Note that the equation could be easily generalized to the case where the aggregation is soft — i.e. there are several 
aggregate states corresponding to i, simply by summing all the possibilities as weighted by the aggregation probabilities. 


Mo. 


1 

0 

0 

D 
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fi wherever the option does not terminate and rows from the identity matrix wherever it does. Now, we 
do not just need a model that takes us one step towards the subgoal — we want one that takes us all the 
way. Therefore, we continually evaluate the option by exponentiating the model matrix, producing M'^. 
Now, this new model still has rows from the identity matrix where the option terminates — therefore it 
does not correspond to a valid combination of primitive actions. To solve this problem, we compute M", 
according to the following equation (for each state i). 

M"(* + l,:) = (l-/3(0(s)))M'~(* + l,:) + 

/3('('(s)) ^m( 0 (s))(* +I’O (12) 

M” contains rows from M'°° where the option does not terminate and rows dictated by the option policy 
where it does. This guarantees it is a valid combination of primitive actions and can be added to the 
action set and treated like any other action. We now run value iteration (equationusing the extended 
action set — the original actions and the subgoal models corresponding to each subgoal q. This 

is s faster than using the original actions alone, even after factoring in the time used to compute the 
subgoal models (see section]^. 

Observation 1. Value Iteration with the action set AV {(M")*^^\ ..., (M")*^®^} converges to the optimal 
value function of the MDP. 

Proof outline. The addition of subgoal macro-operators to the action set does not change the fixpoint of 
value iteration because the macro-operators are, by construction, compositions of the original actions. 
See supplement to existing work |16] for a formal proof of a more general proposition. 

This observation tells us that our algorithm will always exactly solve the MDP, computing V*. The 
worst thing that can happen is that the subgoal macro-operators will be useless i.e. the resulting value 
iteration will take as many iterations as without them. 


5 Why not Use Linear Features 

Looking at eqn. one may ask if this is the best way to compress actions. It may seem that using 
linear features mini ED may be better because they are more expressive and easier to come up with 
than $ and D. Specihcally, consider the following way of compressing actions, as an alternative to eq. 

Define the approximation architecture V = for modelling value functions, the sequence of which 
will converge to the optimal value function. We begin by defining the projection operator [HE] that 
compresses a table-lookup model M into a model that works with linear features. 




■ 1 

0 

M 

■ 1 

0 ■ 

0 

n- 

0 



(13) 


In the above, H" = (^'^.^5') and ^ is a diagonal matrix with entries corresponding to a distri¬ 

bution over the original states of the MDP. We introduce names for the minor matrices of the models: 




1 0 


9 


F 


and M = 


1 0 


R P 


We note that eq. 


13 


ensures that the model M'^’“(M) 


is the best approximation of the model M in the sense that it solves the optimization problems 
F = argmin^ ~ ^’^'11- and q = argmin^ ||'I'g — In the above, the optimization is applied to 

the transition and reward components separately; also, each column of F is treated independently of the 
others. The semantics of the above is as such: each column k oi F should be such as to make the entry 
s of the corresponding k-th column of 'LF as close as possible to the feature number k of the next state, 
where the index of the current state is s. Similarly, 'kg is picked so as to approximate the expected next 
reward for each state. In other words, F is a linear dynamical system that models the one-step dynamics 
on features of the Markov chain corresponding to an action. One might hope that this F and q linear 
dynamical system could be used in much the same way as the MDP compressed with state aggregation 
to P and R. 

But there is a problem with the compressed models defined according to eq. Consider an action 
with the transition matrix and approximation architecture 'k given below. 


F = 7 


= 


F = ;l 


2 3 
2 0 


It can be easily shown that this, paired with a uniform distribution produces the matrix F given 
above. But this matrix has spectrum outside of the unit circle for some 7 < 1 — hence if this action is 

norms are defined in the following way: ||PHh = VV^'BV and ||A||h = wtrace (A^^SA) . 
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Figure 1: Run-times of our algorithm, plain VI and model VI. All algorithms compute V*. 


Domain 

plain VI 

model VI 

options + 
aggr. 

Taxi (determ.) 

6.43 s. 

11.64 s. 

4.57 s. 

Taxi (stoch.) 

8.30 s. 

47.80 s. 

4.83 s. 

Hanoi (determ.) 

23.45 s. 

51.65 s. 

11.57 s. 

Hanoi (stoch.) 

27.31 s. 

357.52 s. 

21.71 s. 

8-puzzle (determ.) 

100.19 s. 

221.20 s. 

85.94 s. 


composed with itself time and again, the VI algorithm will diverge. The argument given above shows that 
we canno a use eq. for arbitrary features and distributions ,:i. On the other hand, our framework 


based on eq. does not suffer from the described divergent behaviour. Also, it does not depend on any 
distribution over the states, meaning there is one less parameter to the algorithm. 


6 Experiments 

We applied our approach to three domains: Taxi, Hanoi and 8-puzzle. In each case we compared several 
variants of VI, including our approach combining state aggregation and options. For vanilla VI we 
considered algorithms based on both eq. (the familiar algorithm, denoted plain VI) and eq. (model 
VI, where complete models are constructed). Figuresummarises the solution times for each domain; 
more details are given in the following domain-specific subsections. We, however, stress beforehand that 
our algorithm produced a speeed-up for each of the domains we tried. 

6.1 The TAXI Problem 

TAXI [2 is a prototypical example of a problem which combines spatial navigation with additional 
variables. Denote the number of states as n (here n = 7000 -I- 1) and the number of aggregate states as 
m (here m = 25 -I- 1). The one state is the sink state. 

In our first experiment, we ran four algorithms computing the same optimal value function., one for 
each combination of using (or not) state aggregation and options. Consider using neither aggregation 
nor options — this is model VI, one iteration of which has a complexity of 0(7 t,^|A| -I- n?), in practice 
it is 0(n|A|) because of sparsity. It takes 22 iterations to complete. Now consider the version with 
subgoals but no aggregation. Here, we have 5 subgoals: one for getting to each pick-up location or the 
fuel pump. An iteration now has complexity 0((/((|A| -I- (/)n^ -I- n^)). Because of sparsity, this becomes 
0(5((|A| -I- g)n + n)) = 0(5((|A| -I- g)n). The algorithm needs 8 iterations less to converge, because 
subgoals allow it to make jumps. However, due to the increased cost of each iteration, the time required 
to converge increased. Now look at the version with aggregation (see section]^ and no options. There 
are 26 aggregate states. We map each original state to one of 25 states by taking the taxi position and 
ignoring other variables. Sink state (state 7001) gets mapped to the aggregate sink state (state 26). 
We proceed in two stages. First, all actions are compressed (eq. [^. Then, the problem is solved using 
model VI in this smaller state-space. This takes 330 iterations, but is fast because m is small — the 
complexity is 0(m^|A| -I- w?). We then obtain the value function of the aggregate system and upscale it, 
then we use the new value function to obtain a greedy model (i.e. each row comes from the action that 
maximizes that row times V), which we use as initialization in our iteration, which takes 3 iterations 
less than our original algorithm. Now consider the final version, where the benefits of aggregation and 
options are combined. Again, the algorithm consists of two stages. First, we use compressed actions to 
compute models for getting to the five subgoals. This requires 17 iterations; the complexity of each is 
0(5((|A| -I- g)m‘^ +m^)), where 5 = 5. This is fast since m is small. We now upscale these models. We see 
that if we add the five macro-actions, we do not need the original four actions for moving, as all sensible 
movement is to one of the five locations. The algorithm now takes only 7 iterations to converge!^ 
The run-timj^ is 6.55 s, i.e. a speed-up of 1.8 times over model VI. Results for all four versions are 

S is chosen to be the diagonal matrix with entries from the left principal eigenvector of the state transition matrix 
of the P, it turns out [4] that, the matrix F has spectrum within the unit circle, which leads to a convergent algorithm. 
However, the problem is that this approach only takes into account the long-term effects of actions. For instance in a 
two-dimensional grid-world, for the action that goes right, such a distribution will be non-zero only along the rightmost 
edge of the grid-world. In practice P will seldom be ergodic and all information in the non-recurrent part will be lost. This 
cannot lead to good overall solutions of the MDP. 

^^We need an iteration to: (1) go to the fuel pump, (2) fill in fuel, (3) go to passenger, (4) pick up passenger, (5) go to 
destination, (6) drop off passenger. The 7th iteration comes from the termination condition. 

^^This is slightly different from the result in fig. [^since after the models have been upscaled, we can proceed either with 
plain VI (as is fig. or with model VI, which we do here to make the comparison fair. 
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Figure 2: Run-times of the algorithm in the deterministic and stochastic versions of TAXI . 


deter. 

no aggreg. 

aggregation 

no options 

22 iter. 

330 -1- 19 iter. 


11.64 s. 

11.73 s. 

options 

14 iter. 

17-1-7 iter. 


78.20 s. 

6.55 s. 


stoch. 

no aggreg. 

aggregation 

no options 

30 iter. 

331 4- 28 iter. 


47.80 s. 

26.04 s. 

options 

18 iter. 

20-1-7 iter. 


256.04 s. 

6.78 s. 


summarized in figure We also constructed a stochastic version of the problem, with a probability of 
0.05 of staying in the original state when moving. Results are qualitatively similar and are in figure 
The speed-up from combining options with aggregation was greater at 7.1 times. We stress the main 
result In the deterministic case, we replace many 0(n) iterations with many O(m^) iterations followed 
by few 0(n) iterations. For stochastic problems, we replace many O(n^) iterations with many O(m^) 
iterations followed by few O(n^) iterations. 

In our second experiment, as a digression from the main thrust of the paper, we tried a different 
approach: we can use the aggregation framework to compute an approximate value function, gaining 
speed. Our actions are compressed as defined by eq. and we simply apply eq. This process gives 
us a value function V* defined over the aggregate state space (in the hrst case we need to extract it 
from the reward part of the model). We upscale this value function to the original states using the 


equation V = 


< 1 ) 


V*. Of course, the obtained value function V is only approximately optimal in 


the original problem. Consider a <i> with 501 aggregate states — the aggregation happens by eliminating 
the fuel variable and leaving others intact. The algorithm used is given by eq. applied to compressed 
actions. It takes 2.94 s / 28 iterations to converge (determ.) and 3.08 s / 30 iterations (stoch.). The 
learned value function corresponds to a policy which ignores fuel, never visits the pump, but otherwise, 
if there is enough fuel, transports the passenger as intended. We have shown an important principle — 
if we have an aspect of a system that we feel our solution can ignore, we can eliminate it and still get an 
approximate solution. The benefit is in the speed-up. — in our case, with respect to solving the original 
MDP using plain VI, it is 2.2 (determ.) / 2.7 (stoch.). 


6.2 The Towers of Hanoi 

For r disks, our state representation in the Towers of Hanoi is an r-tuple, where each element corresponds 
to a disk and takes values from {1,2,3}, denoting the pegj^ There are three actions, two for moving 
the smallest disk and one for moving a disk between the remaining two pegs. It is known that VI for 
this problem takes 2’" iterations to converge. To speed up the iteration, we introduced the following 
state abstraction. There are r — 2 sub-problems of size 2,...,r — 1. First, we solve the problem with 2 
disks, i.e. our abstraction only considers the position of the two smallest disks, ignoring the rest. There 
are three subgoals, one for placing the two disks on each of the pegs. Then, once we obtained three 
models for the subgoals, we use them to solve the sub-problem of size 3, ignoring all disks except the 
three smallest ones. Again, there are three subgoals. We proceed until we solve the problem with r 
disks. For each subgoal, we need 4 iterations (Three moves and the 4th is required for the convergence 
criterion). The total number of iterations is 4 x 3 x r, i.e. it is linear in the state space. For 8 disks this 
means the following speed-up: 11.57 s (with subgoals) vs. 51.65 s (model VI) vs. 23.45 s (plain VI). We 
note however, that the time complexity of the algorithm with subgoals is still exponential in r, because 
whereas the number of iterations is only linear, in each iteration we need to iterate the whole state space, 
which is exponential]^ For a stochastic version, the run-times were 357.52 s for model VI, 27.31 s for 
plain VI and 21.71 s for computing the same optimal value function with options with aggregation. 
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Figure 3: The subgoal used and run-times for the 8-puzzle. All algorithms compute V*. 



iter. 

model VI 

32 

plain VI 

33 

subgoal 

25 

subgoal w. init. set 

25 


time elapsed 

221.20 s. @®® 

100.19 s. ®®® 



6.3 The 8-puzzle 


‘A’/B’, and ‘C’ denote groups of tiles. The subgoal consists in arranging the tiles so that each group is 
in correct place (but tiles within each group are allowed to occupy an incorrect place). The matrix $ is 
such that the original configuration of the tiles is mapped onto one where each tile is only marked with 
the group it belongs to. Using the subgoal alone did not result in a speed-up, so we used the notion 
of initiation sets m- We trained the subgoal for 9 iterations (the number 9 was obtained by trial and 
error), so the obtained model is only able to reach the subgoal for some starting states (the ones at most 
9 steps away from the subgoal in terms of primitive actions). We upscaled the model, but this time the 
new model had an initiation set containing only those states from which the subgoal is reachable. The 
iteration we then used is plain value iteration, extended to initiation sets. The intuition behind initiation 
sets is that it only makes sense to use a subgoal if we are already in a part of the state space close to 
it. Thus, we obtained a total run-time of 85.94 seconds, which amounts to a speed-up of 1.17 over plain 
value iteration. The results are in figure]^ 


The 8-puzzle [H M is well-known in the planning community. Our subgoal is shown in figure 


7 Conclusions 

We introduced new Bellman optimality equations that facilitate VI with options. These equations can be 
combined with state aggregation in a sound way, and therefore can be applied to the solution of medium¬ 
sized MDPsj^ This is the first algorithm combining options and state abstraction which is guaranteed 
to converge. This is notable because other proposed approaches, notably based on linear features, are 
known to diverge even for small problems. Finally, we have shown experimentally that the benefits of 
options and state aggregation are only realized when they are applied together. 


References 

[1] D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. 
In AAAI Conference on Artificial Intelligence / Annual Conference on Innovative Applications of 
Artificial Intelligence, pages 119-125, 2002. 

[2] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957. 

[3] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific Belmont, 
2012 . 

[4] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration 
and temporal-difference learning. Journal of Optimization Theory and Applications, 105:589-608, 
2000 . 

[5] T. G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning. In International 
Conference on Machine Learning, pages 118-126, 1998. 

[6] B. Hengst. Discovering hierarchy in reinforcement learning with HEXQ . In International Conference 
on Machine Learning, volume 2, pages 243-250, 2002. 

^■^If the number of subgoals and actions is constant. 

^®Note that the state representation itself disallows placing a larger disk on top of a smaller one. 

^®However, this problem is not particular to our approach — every algorithm that purports to compute the value function 
for each state will have computational complexity at least as high as the number of such states. 

^^Other subgoals are shown in the documentation accompanying the source code. Please also consult the source code, 
where all subgoals are implemented. 

^®We provide software used in our experiments under GPL in the hope that others may use it for their problems. 


8 




[7] N. K. Jong and P. Stone. State Abstraction Discovery from Irrelevant State Variables. . In Inter¬ 
national Joint Conferences on Artificial Intelligence, pages 752-757, 2005. 

[8] A. Jonsson and A. G. Barto. Automated state abstraction for options using the U-tree algorithm. 
Advances in neural information processing systems, pages 1054-1060, 2001. 

[9] G. Konidaris and A. G. Barto. Building Portable Options: Skill Transfer in Reinforcement Learning. 

. In International Joint Conferences on Artificial Intelligence, volume 7, pages 895-900, 2007. 

[10] R. Korf. Learning to Solve Problems by Searching for Macro-Operators. Research Notes in Artificial 
Intelligence, Vol 5. Pitman, 1985. 

[11] D. J. Lizotte. Gonvergent fitted value iteration with linear function approximation. In J. Shawe- 
Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors. Advances in Neural Information 
Processing Systems 24, pages 2537-2545. 2011. 

[12] R. Parr, L. Li, G. Taylor, G. Painter-Wakefield, and M. L. Littman. An analysis of linear models, 
linear value-function approximation, and feature selection for reinforcement learning. In Proceedings 
of the 25th international conference on Machine learning, ICML ’08, pages 752-759, New York, NY, 
USA, 2008. AGM. 

[13] D. Precup, R. S. Sutton, and S. Singh. Theoretical results on reinforcement learning with temporally 
abstract options. In Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer 
Science, pages 382-393. Springer Berlin Heidelberg, 1998. 

[14] A. Reinefeld. Gomplete solution of the eight-puzzle and the benefit of node ordering in ida*. In 
International Joint Conference on Artificial Intelligence, pages 248-253, 1993. 

[15] J. J. Ribas-Fernandes, A. Solway, G. Diuk, J. T. McGuire, A. G. Barto, Y. Niv, and M. M. Botvinick. 
A neural signature of hierarchical reinforcement learning. Neuron, 71(2):370-379, 2011. 

[16] D. Silver and K. Giosek. Compositional planning using optimal option models. In 29th International 
Conference on Machine Learning, 2012. 

[17] J. Sorg and S. Singh. Linear options. In Proceedings of the 9th International Conference on Au¬ 
tonomous Agents and Multiagent Systems: Volume 1 - Volume 1, AAMAS ’10, pages 31-38, Rich¬ 
land, SC, 2010. International Foundation for Autonomous Agents and Multiagent Systems. 

[18] W. E. Story. Notes on the “15” puzzle. American Journal of Mathematics, 2(4):397-404, 1879. 

[19] R. Sutton, D. Precup, and S. Singh. Between MDPs and Semi-MDPs: A Framework for Temporal 
Abstraction in Reinforcement Learning. Artificial Intelligence, 112:181-211, 1999. 

[20] R. S. Sutton. TD Models: Modeling the World at a Mixture of Time Scales . In Proceedings of the 
Twelveth International Conference on Machine Learning, pages 531-539. Morgan Kaufmann, 1995. 

[21] B. Van Roy. TD(0) Leads to Better Policies than Approximate Value Iteration . In Y. Weiss, 
B. Scholkopf, and J. Platt, editors. Advances in Neural Information Processing Systems 18, pages 
1377-1384. 2005. 

[22] H. Wang, W. Li, and X. Zhou. Automatic discovery and transfer of maxq hierarchies in a complex 
system. In ICTAI, pages 1157-1162, 2012. 


8 Appendix 

In this appendix, we discuss background information concerning state aggregation for MDPs, adapted to 
the notation of our paper. This is necessary because Bertsekas’ original notation is difficult to apply to 
our work. We stress that the ideas presented in this appendix are entirely due to Bertsekas [3]. 

We are concerned with an MDP which has \A\ actions, and for an action a the probability transition 
matrix is Pa, defined by Pa{i,j) = 7 Pr(zt+i = j\it = i,at = a) and the vector of expected rewards for 
each state is Ra, defined by Ra{i) = V[rt\it = i,at= a]. There are m aggregate states. In addition, 
we introduce two matricej^ defining the approximation architecture: the aggregation matrix $ and 
the disaggregation matrix D. The matrix <I> has dimensions n x m and the matrix D has dimension 
m X n. It is useful to think about these matrices as conversion operators: the matrix $ converts a value 
function defined over the aggregate states into one defined over the original states; conversely, the matrix 

employ the names introduced by Bertsekas [3]. 
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D converts a value function defined over the original states into one defined over the aggregate states. 
There are no conditions on these matrices other than the rows have to sum to one, as they are probability 
distributions modeling, for $, the degree by which each state is represented by various aggregate states 
and, for Z?, the degree to which a certain aggregate state corresponds to various original states. Having 
defined the matrices, we can define our first approximation step. The Bellman optimality operator in 
the original MDP is called T, and is defined by {TV){i) = uiaxa{PaV){i) + Ra{i) and the optimum value 
function V* satisfies the fixpoint equation V* = TV* . Now, the approximation consists in solving the 
following equation instead (we will see later that this is not solved exactly and further approximation is 
necessary). 

= DT{^V*) (14) 

In the above, we use ~ to denote the aggregate problem. We note that this equation operates on a shorter 
value function — V* has entries corresponding to aggregate states. The idea is, of course that the number 
of aggregate states is tractable, so we can compute V*. However, we need to reformulate the equation 
since in its present form it contains the operator T, which still operates on the original states. To do so, 
we expand the definition of T, to obtain the following state-wise equation, for the aggregate state x. 

V*{x) = ^da:^ (maxPa{i,:)^V* + Ra{i)j 


This equation leads to the following iterative algorithm, which computes V* as /c —>■ oo. 

V{k+i)ix) = E dxi (^maxPa(f, :)^^(fe) + Ra{i)^ 

i 

In the above, Pa{i, '■) denotes the row number i of the probability transition matrix corresponding to 
action a (in terms of the original states). Value functions are assumed to be column vectors. In order 
to be able to operate exclusively with objects that have dimensionality corresponding to the number of 
aggregate states, we introduce another approximation and namely we do the following. 

V(fe+i)(a;) = max E ^xi (^Paih 

i 

We note that this approximation is exact if states mapping to a single aggregate state all have the same 
optimal action. Now, we can reformulate the equation in the following way. 

V(^k+i)ix) = maxD{x, :)Pa^V(k) + D{x, :)Ra 

= max{PaV(^k)){x) + Ra{x) (15) 


In the above, D{x ,:) denotes the row of D corresponding to aggregate state x and Pa is the probability 
transition matrix corresponding to action a in the original MDP. Now, we note that solving the above 
equation is equivalent to solving a modified MDP with actions corresponding to the original actions, 
probability transition matrices given by Pa = DPa^ and expected reward vectors given by Ra = DRa- 
The states of the modihed MDP are the aggregate states. 

Therefore, under our two explained approximations, solving the original MDP may be replaced by 
solving a much smaller aggregate MDP, by computing Pa and Ra- The solution can then be computed 
by any known algorithm, although in this paper we focus only on VI. We emphasize that the VI is 
convergent because the matrices Pa and Ra define a valid MDP. We stress again that this involves two 
approximations: first, we are solving a modified Bellman equation 14 
second, we move the max operator outside of the sum in equation 15 


that utilizes state aggregation and 
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